Arabic Morphosyntactic Raw Text Part of Speech Tagging System Phd Dissertation Chapter 4
نویسندگان
چکیده
We present a comprehensive Arabic tagging system: from the raw text to tagging disambiguation. For each processing step in the tagging system, we analyze the existing solutions (if any) and use one of them or propose, implement and evaluate a new one. This work began with designing a new Arabic tagset suitable for Classical Arabic (CA) and Modern Standard Arabic (MSA). In addition to the classical constructions in tag systems, we introduce interleaving of tags. Interleaving is a relation between tags which, in certain situations, can be attached to the same occurrence of a word, but each of them can also appear alone. Our tagset makes this relation explicit. Then we deal with the preparatory stages for tagging system. The first initial stage is tokenization and segmentation. We use rule-based and statistical methods for this task. The second stage is analyzing and extracting the lemma from the word. We have created our own analyzer compatible with our requirements. Its main part is a dictionary which provides features, POS and lemma for each word. The last part of our work is the tagging algorithm which produces one tag for each word. We use a hybrid method by combining rules-based and statistical methods. Three taggers, Hidden Markov Model (HMM), maximum match and Brill are combined by a new method, which we call master and slaves. Then handwritten rule-based tagger is also added to master-slaves. The rule based tagger eliminates incorrect tags, and the master chooses the best one among the remaining ones, assisted by the other slaves. Our complete system is ready to be used for annotation of Arabic corpora.
منابع مشابه
Arabic Morphosyntactic Raw Text Part of Speech Tagging System
Introduction and Overview: The topic of this dissertation is morphosyntactic part of speech tagging (abbreviated POS tagging) for Arabic. This topic has long and rich history for other languages, mainly for English. POS Tagging provides fundamental information about word forms used in sentences of natural language. The method of utilizing this information varies depending on the particular NLP ...
متن کاملJoint Prediction of Morphosyntactic Categories for Fine-Grained Arabic Part-of-Speech Tagging Exploiting Tag Dictionary Information
Part-of-speech (POS) tagging for morphologically rich languages such as Arabic is a challenging problem because of their enormous tag sets. One reason for this is that in the tagging scheme for such languages, a complete POS tag is formed by combining tags from multiple tag sets defined for each morphosyntactic category. Previous approaches in Arabic POS tagging applied one model for each morph...
متن کاملLemmatization and Morphosyntactic Tagging of Croatian and Serbian
We investigate state-of-the-art statistical models for lemmatization and morphosyntactic tagging of Croatian and Serbian. The models stem from a new manually annotated SETIMES.HR corpus of Croatian, based on the SETimes parallel corpus. We train models on Croatian text and evaluate them on samples of Croatian and Serbian from the SETimes corpus and the two Wikipedias. Lemmatization accuracy for...
متن کاملMulti-source morphosyntactic tagging for spoken Rusyn
This paper deals with the development of morphosyntactic taggers for spoken varieties of the Slavic minority language Rusyn. As neither annotated corpora nor parallel corpora are electronically available for Rusyn, we propose to combine existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish and adapt them to Rusyn. Using MarMoT as tagging toolki...
متن کاملMorphosyntactic Tagging of Slovene Legal Language
Part-of-speech tagging or, more accurately, morphosyntactic tagging, is a procedure that assigns to each word token appearing in a text its morphosyntactic description, e.g. “masculine singular common noun in the genitive case”. Morphosyntactic tagging is an important component of many language technology applications, such as machine translation, speech synthesis, or information extraction. In...
متن کامل